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^ ■ Abstract - This paper briefly presents several ways to understand the organization of a large 

\ social network (several hundreds of persons). We compare approaches coming from data min- 

I— i" ing for clustering the vertices of a graph (spectral clustering, self- organizing algorithms. . .) 

and provide methods for representing the graph from these analysis. All these methods are 
. illustrated on a medieval social network and the way they can help to understand its organi- 

^ \ zation is underlined. 

^ ■ 
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^ : 

■ 1 Introduction 
m ■ 

A large number of practical applications can be modeled through what is commonly called a 
"complex network" . Complex networks are relational data, that appear in World Wide Web 
OO ! studies, in social networks or in biological studies (genes, proteins, metabolites interaction 

■ networks) for example. 

J> ■ This work is based on a historical database built from the archives of Lot, a small reerion in 

^ . South West of France. This database has already been presented in [2J: in a tiny geographi- 

^ I cal location around Castelnau-Montratier, a large documentation has been collected (see 

for a complete presentation). This documentation, made of about 1000 agrarian contracts 
(available at http://graphcomp.univ-tlse2.fr) is a very precious source of information 
about the peasants' usual life in the middle ages where most of the written documents were 
concerned by the well-educated part of the population. All the contracts are agrarian trans- 
actions: they mention the name of the involved peasant (or the peasants), the names of the 
lord and the notary to whom the peasants are related, some of the neighbors of the peasants 
and various other informations (such as the type of transaction, the location, the date, and 
so on). All the studied transactions were written between 1260 and 1340 that is, just before 
the Hundred Years' War but others concerned the period just after this War. 
From this database, a relational network is built following the advices provided by the his- 
torians (see |4J). This social network is described by a weighted graph with 615 vertices (the 
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peasants) and 4193 edges standing for the relations between them. The edges are weighted 
by the number of relations found between two given peasants. The obtained graph is de- 
scribed in [4] where it is analyzed through the comparison of an algebraic study and of a 
SOM algorithm. The collaboration between mathematicians, computer scientists and his- 
torians intends to provide several tools for historians to understand this complex and large 
network. A part of the methods developed are coming from statistics and data mining and 
will be reviewed and illustrated in this paper. Complementary material could be found in ^ 
and complementary studies of the database are available in [3^ 2J. 

The paper is organized as follows: Section [2] presents the problem of clustering the vertices 
of a large graph and explains how this problem can help to understand the structure of the 
graph. Several methods are reviewed and some of them, coming from what is called spectral 
clustering, are illustrated on the medieval database. Section [3] explains how this clustering can 
be used to provide a simplified representation of the graph. This leads us to use organization 
algorithms designed for graphs in order to classify and organize simultaneously the vertices 
of the graph: kernel SOM, described in section [H targets such a dual objective. Finally, 
Section [5] intends to represent the whole graph from this final organizing map. Examples of 
insights on the data obtained by the reviewed methods are given at each step of the analysis. 

2 Clustering the vertices of the graph 

Large graphs representing complex networks are not easy to understand. One way to simplify 
them, in order to underline the main tendances of their structure, is to find dense subgraphs 
that have few connections to each others. As emphasized by [13j, 

reducing [the] level of complexity [of a network] to one that can be interpreted 
readily by the human eye, will be invaluable in helping us to understand the large- 
scale structure of these new network data. 

But such a clustering of the vertices of a graph is facing the problem of relational data: there 
is no a priori distance between two vertices and thus classical clustering algorithms, such as, 
e.g., fc-means, cannot be directly used. A recent survey on clustering methods adapted to 
graphs is provided in [H] . Clustering the vertices of a graph is commonly addressed by the use 
of a dissimilarity between vertices or by mapping the graph on an euclidean space; then usual 
data mining tools can be used to find a convenient clustering. Recently, spectral clustering 
became a successful method among this kind of methodologies (see for a very exhaustive 
tutorial on this subject): spectral clustering uses the properties of the Laplacian of the graph 
to understand its structure. Given a weighted graph Q with vertices V — {xi . . . and 
edges weighted by {"^i.j — "^j.i ^i^d Wi^i — 0), the Laplacian is the matrix L such 



for understanding the graph as its eigenvalue decomposition is directly related to the min cut 
problem ("How to find a partition of the vertices that minimizes the number of cuts in the 
graph ?", see [24j) and to the problem of finding perfect communities, i.e., complete subgraphs 
those vertices have exactly the same neighbors (see [T^'2T]). More precisely, spectral clustering 
uses the eigenvectors associated with the smallest eigenvalues of the Laplacian to map the 
graph on an Euclidean space where a fc-means algorithm is performed. 




The Laplacian appears as a very convenient tool 
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But [4] notes that the spectral clustering method gives equal weights to the first p eigenvectors 
of the Laplacian, whereas the smaller the eigenvalue is, the more important the corresponding 
eigenvector is. Moreover, only the first p eigenvalues are used and, hence, this approach does 
not use the entire structure of the graph. To avoid these problems, one can use a regularized 
version of the Laplacian: the heat kernel (also called the diffusion kernel). The diffusion 
matrix of the graph Q for the parameter /? > is = e~^^ and the diffusion kernel is the 
function Kp : (xi^xj) ^ V xV ^ ^ij- This diffusion kernel has been intensively studied and 
used through the past years (see O [TOl [TTl [151 [22j, among others). One of its main desirable 
properties comes from Aronszajn's Theorem [J that states that there is a reproducing kernel 
Hilbert space (RKHS), Hj3, called the feature space, and a mapping function, (j)^ : V ^ Hj3 
such that for all i, j, 



This last equation is commonly known as kernel trick and means that Kfj is simply a scalar 
product between images by (j)^ of the vertices of the graph. Similarly as spectral clustering, 
a fc-means algorithm can be performed on this mapping. This method is known under the 
name kernel k-means (see [Ml [71 [6]). 

On a practical point of view, partitions coming from spectral clustering and kernel fc-means 
cannot be directly compared by the way of fc-means error because the mapping of the graph 
is not the same: the underlined metrics are not comparable. The same occurs for partitions 
coming from kernels with different values of f3. To enable a comparison between all these parti- 
tions, we use a quality measure introduced by |12j, the q-modularity, Qj^iodul ~ 
where k is the number of clusters, ej is the fraction of edges in the graph that connect two 
vertices in cluster j and aj is the fraction of the edges in the graph that connect to one 
vertex in cluster j. This criterion does not depend on a mapping or a dissimilarity on the 
graph and is easily interpretable in terms of probability of having in/between-clusters edges: 
a high g-modularity means that vertices are clustered into dense subgraphs having few edges 
between them. 

Table [1] summarizes the main characteristics of the partitions into 50 clusters of the medieval 
graph obtained by these two approaches. Obviously, both partitions share common proper- 
ties: the g-modularity is similar and the vertices are concentrated in a few number of large 
clusters. More than three fourth of the vertices belong to a cluster having less than 7 vertices 
and the largest cluster contains more than one third of the vertices of the whole graph. In 
conclusion, the partitions provided by these two simple tools have to be improved. 




(1) 



Spectral clustering 



Kernel /c-means 



(/3 = 0.05) 



g'-modularity 



0.4195 



0.4246 



Number of clusters of size 1 
Maximum size of the clusters 
Median of the clusters' size 
3rd quart ile of the clusters' size 



16 
268 
2 
7 



17 
242 

2 

7 



Table 1: Details about the partitions obtained by spectral clustering and kernel fc-means 
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3 Drawing the graph 

From any partition obtained by clustering the graph, a simphfied representation can be ob- 
tained by assigning a given glyph to each cluster where the surface of the glyph is proportional 
to the number of vertices of the given cluster. At the same time, glyphs are connected to 
each others by edges whose width is also proportional to the total number of weights between 
the vertices belonging to the two corresponding clusters. Glyphs can be spatially positioned 
by a force directed algorithm that aims at providing an aesthetic representation of a graph 
by assigning forces amongst the edges and nodes (see [8J). Examples of such a representation 
are given in Figure [T] for the two partitions described in section [Jl. 




Figure 1: Force-directed algorithms used for a simplified representation coming from spectral clustering 
(left) and kernel /c-means (right) 

Both representations share common properties that can be seen as main structural properties 
of the graph: the network has a star shaped structure with two main groups of central people 
that can be seen as a kind of "rich club" (see [4j). Some tiny groups are totally isolated from 
this two main central clusters and linked to other secondary clusters. The two major clusters 
are strongly linked to each others. These pictures seem to give understandable representations 
of the structure of the graph but, unfortunately, the two main clusters respectively contain 
about 250 and 100 vertices, that is, more than half of the vertices of the graph: then, these 
two clusters are almost as complex as the initial graph. 

4 A clustering and organizing algorithm 

Several kernelized versions of SOM algorithm, that can perform simultaneously the objectives 
described in sections [2] (clustering) and [3] (representation), has been described in The 
present paper uses a batch version of the kernel SOM (that generally converges much faster) 
proposed in [41|23j. 

The aim of self-organizing algorithms is to project the initial data on a prior structure that 
is generally a grid consisting in M neurons. A neighborhood relationship is defined on the 



^All the graph figures have been made with the free software TuUp, available at 
|http : //www . tulip- software . org/ | 
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grid and the projection intends to preserve the initial topology of the data on this grid. The 
batch kernel SOM is simply a batch SOM performed on data that have been mapped on a 
RKHS; the algorithm is rewritten by the way of the kernel trick (Equation ([T])). 
This algorithm has been applied to the medieval graph with a rectangular grid of size 7x7. 
The main characteristics of the obtained partition is summarized in Table [2l It is compared 
to the partition obtained by using the batch SOM on the rows of the k eigenvectors associated 
to the smallest eigenvalues (this last approach has been named "spectral SOM"). 





Spectral SOM 


Kernel SOM 
{p = 0.05) 


^'-modularity 


0.433 


0.551 


Final number of clusters 


29 


35 


Number of clusters of size 1 


11 


13 


Maximum size of the clusters 


325 


255 


Median of the clusters' size 


2 


3 


3rd quart ile of the clusters' size 


10 


10 



Table 2: Details about the clustering obtained by spectral SOM and batch kernel SOM 

The corresponding simplified representations, respecting the topology of the map, are pro- 
vided in Figure Hd where an additional information is given by the U-matrices (see [20]) that 
smoothly represent the mean distances (respectively in the space generated by k eigenvec- 
tors associated to the smallest eigenvalues and in the feature space) between the prototypes 
of each cluster. Clearly, kernel SOM provides better clustering than spectral SOM (larger 
g-modularity, much less vertices in the largest cluster). It also seems to be a little bit better 
than the spectral clustering and the kernel fc-means (larger g-modularity, smaller number of 
tiny clusters - with less than 5 vertices). 

In both cases, the simplified representations are well organized and easy to understand. 
Compared to Figure [U the representation provided by kernel SOM is very close to the one 
provided by a simple clustering followed by a force directed representation of the clusters: 
a large cluster has a central position and is surrounded by smaller clusters. But looking at 
the u-matrix, the map is clearly divided into three main part (top left, top right and bottom 
right) which, according to color levels, are distant to each others. 

This fact is clearly explained by Figure [3] (left) where the significance of each cluster clearly 
appears: the top left part of the map is the oldest cluster whereas the top right part is the 
youngest, with a continuous connexion of the dates on the map. Figure [3] (right) also provides 
some interesting informations about the social network: in particular, the top left part of the 
map has an homogeneous geographical setting which is the small village of Divilhac. This 
part of the map is only linked with the large clusters at the bottom right by a single peasant. 
This peasant doesn't live in this village but in the dominant village of the clusters to which 
he is linked at the bottom right of the map (St JulieiJl). Then, it seems that generational 
relationships and geographical ones are very important in this network. Moreover, it is 
suprising to see how even large clusters have a good homogeneity of their geographical settings 

^Colored and high quahty images can be found at |http : //nathalie . via laneix . f ree . f r/maths/article-normal .php3?id_artic 
^Readers interested by the location of the villages named in this paper will find a approximate map at 
|http : / /maps . google . coin/maps/ins?ie=UTF8&hl=f r&insa=0&insid=100355826667676777753 . 000001 134e74760 eae6cd&z=10"] 
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(see the top right and bottom right clusters for example). 



5 Representing the whole graph from the kernel Self-Organizing 
Map 

As the layout used for Figure [2] has been built to be well organized, it provides an interesting 
starting point for a readable presentation of the whole graph. In [I9l [lEl, Truong et al. 
developed an energy model, in the spirit of force directed algorithms, but under location 
constraints. This model intends to represent graphs that are already clustered. By applying 
this algorithm to the self-organizing map presented in Figure El we obtain the representation 
of Figure [4] where the names of some peasants in the smallest clusters have been added. 
If it is obvious that the representation of the bottom right part of the map still has to 
be improved, some important facts that seem to be of interest for historians have been 
emphasized: first of all the peasant that links the top left part of the map to the bottom 
right one is "Pierre Fornie" . This man is already known by historians to be a major character. 
This name also appear a bit appart from the main clusters and was identified by historians 
as the same people (and not a namesake) which means that some ambiguities still exist in 
the database (database correction is currently underway, partly due to this first analysis). 
Moreover, the persons having a geographical setting different from the rest of the top left 
part of the map (Pern and Ganic instead of Divilhac) are named Trapas and Tessendier. 
They also belong to families known for their dominant positions. Then, some clusters roughly 
homogeneous on the geographical point of view are connected to similar clusters via important 
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Figure 3: Mean dates of each clusters with standard deviation in parenthesis (left) and geographical 
settings distribution of each cluster (right) 
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Figure 4: Representation of the whole medieval graph coming from kernel SOM 



families that do not live in the same area and that can be seen as important links between 
villages. Finally, the top right part of the map is linked to the bottom right one by a single 
family named Aliquier family, this leads to identify this family as being very important for 
the social cohesion of the network. 

All these remarks have helped historians to understand the organization of the social network. 
Moreover, dominant families have been identified through this first study: the next objective 
of this project is to understand how they structured the society and also how they have 
evolved through the hard break of the Hundred Years' war. 

This work won't have existed without the ANR Graph-Comp's team. The authors thank Bertrand Jouve, 
project's coordinator, for this very interesting subject and for all discussions about it. We also want to thank 
Romain Boulet, Taofiq Dkaki and Pascale Kuntz for helpful discussions, Fabien Picarougne and Bleuenn Le 
Goffic who entirely created and managed the database and, of course, Florent Hautefeuille, historian at UMR 
TRACES (University of Toulouse Le Mirail), who provides us helpful comments and analysis. 
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